294 research outputs found
A complexity analysis of statistical learning algorithms
We apply information-based complexity analysis to support vector machine
(SVM) algorithms, with the goal of a comprehensive continuous algorithmic
analysis of such algorithms. This involves complexity measures in which some
higher order operations (e.g., certain optimizations) are considered primitive
for the purposes of measuring complexity. We consider classes of information
operators and algorithms made up of scaled families, and investigate the
utility of scaling the complexities to minimize error. We look at the division
of statistical learning into information and algorithmic components, at the
complexities of each, and at applications to support vector machine (SVM) and
more general machine learning algorithms. We give applications to SVM
algorithms graded into linear and higher order components, and give an example
in biomedical informatics
On the probabilistic continuous complexity conjecture
In this paper we prove the probabilistic continuous complexity conjecture. In
continuous complexity theory, this states that the complexity of solving a
continuous problem with probability approaching 1 converges (in this limit) to
the complexity of solving the same problem in its worst case. We prove the
conjecture holds if and only if space of problem elements is uniformly convex.
The non-uniformly convex case has a striking counterexample in the problem of
identifying a Brownian path in Wiener space, where it is shown that
probabilistic complexity converges to only half of the worst case complexity in
this limit
The Marr Conjecture and Uniqueness of Wavelet Transforms
The inverse question of identifying a function from the nodes (zeroes) of its
wavelet transform arises in a number of fields. These include whether the nodes
of a heat or hypoelliptic equation solution determine its initial conditions,
and in mathematical vision theory the Marr conjecture, on whether an image is
mathematically determined by its edge information. We prove a general version
of this conjecture by reducing it to the moment problem, using a basis dual to
the Taylor monomial basis on .Comment: 52 pages, 4 figure
On Some Integrated Approaches to Inference
We present arguments for the formulation of unified approach to different
standard continuous inference methods from partial information. It is claimed
that an explicit partition of information into a priori (prior knowledge) and a
posteriori information (data) is an important way of standardizing inference
approaches so that they can be compared on a normative scale, and so that
notions of optimal algorithms become farther-reaching. The inference methods
considered include neural network approaches, information-based complexity, and
Monte Carlo, spline, and regularization methods. The model is an extension of
currently used continuous complexity models, with a class of algorithms in the
form of optimization methods, in which an optimization functional (involving
the data) is minimized. This extends the family of current approaches in
continuous complexity theory, which include the use of interpolatory algorithms
in worst and average case settings
Relationships among Interpolation Bases of Wavelet Spaces and Approximation Spaces
A multiresolution analysis is a nested chain of related approximation
spaces.This nesting in turn implies relationships among interpolation bases in
the approximation spaces and their derived wavelet spaces. Using these
relationships, a necessary and sufficient condition is given for existence of
interpolation wavelets, via analysis of the corresponding scaling functions. It
is also shown that any interpolation function for an approximation space plays
the role of a special type of scaling function (an interpolation scaling
function) when the corresponding family of approximation spaces forms a
multiresolution analysis. Based on these interpolation scaling functions, a new
algorithm is proposed for constructing corresponding interpolation wavelets
(when they exist in a multiresolution analysis). In simulations, our theorems
are tested for several typical wavelet spaces, demonstrating our theorems for
existence of interpolation wavelets and for constructing them in a general
multiresolution analysis
Transcription Factor-DNA Binding Via Machine Learning Ensembles
We present ensemble methods in a machine learning (ML) framework combining
predictions from five known motif/binding site exploration algorithms. For a
given TF the ensemble starts with position weight matrices (PWM's) for the
motif, collected from the component algorithms. Using dimension reduction, we
identify significant PWM-based subspaces for analysis. Within each subspace a
machine classifier is built for identifying the TF's gene (promoter) targets
(Problem 1). These PWM-based subspaces form an ML-based sequence analysis tool.
Problem 2 (finding binding motifs) is solved by agglomerating k-mer (string)
feature PWM-based subspaces that stand out in identifying gene targets. We
approach Problem 3 (binding sites) with a novel machine learning approach that
uses promoter string features and ML importance scores in a classification
algorithm locating binding sites across the genome. For target gene
identification this method improves performance (measured by the F1 score) by
about 10 percentage points over the (a) motif scanning method and (b) the
coexpression-based association method. Top motif outperformed 5 component
algorithms as well as two other common algorithms (BEST and DEME). For
identifying individual binding sites on a benchmark cross species database
(Tompa et al., 2005) we match the best performer without much human
intervention. It also improved the performance on mammalian TFs.
The ensemble can integrate orthogonal information from different weak
learners (potentially using entirely different types of features) into a
machine learner that can perform consistently better for more TFs. The TF gene
target identification component (problem 1 above) is useful in constructing a
transcriptional regulatory network from known TF-target associations. The
ensemble is easily extendable to include more tools as well as future PWM-based
information.Comment: 33 page
BowSaw: inferring higher-order trait interactions associated with complex biological phenotypes
Machine learning is helping the interpretation of biological complexity by enabling the inference and classification of cellular, organismal and ecological phenotypes based on large datasets, e.g. from genomic, transcriptomic and metagenomic analyses. A number of available algorithms can help search these datasets to uncover patterns associated with specific traits, including disease-related attributes. While, in many instances, treating an algorithm as a black box is sufficient, it is interesting to pursue an enhanced understanding of how system variables end up contributing to a specific output, as an avenue towards new mechanistic insight. Here we address this challenge through a suite of algorithms, named BowSaw, which takes advantage of the structure of a trained random forest algorithm to identify combinations of variables (“rules”) frequently used for classification. We first apply BowSaw to a simulated dataset, and show that the algorithm can accurately recover the sets of variables used to generate the phenotypes through complex Boolean rules, even under challenging noise levels. We next apply our method to data from the integrative Human Microbiome Project and find previously unreported high-order combinations of microbial taxa putatively associated with Crohn’s disease. By leveraging the structure of trees within a random forest, BowSaw provides a new way of using decision trees to generate testable biological hypotheses.Accepted manuscrip
- …